24 research outputs found

    String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

    Full text link
    Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text TT of length nn, permutes its symbols according to the lexicographic order of suffixes of TT. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length nn, occupying O(n/log⁥n)O(n/\log n) machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in O(n)O(n) time and O(n/log⁥n)O(n/\log n) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n)\Omega(n) time. In this paper, we propose the first algorithm that breaks the O(n)O(n)-time barrier for BWT construction. Given a binary string of length nn, our procedure builds the Burrows-Wheeler transform in O(n/log⁥n)O(n/\sqrt{\log n}) time and O(n/log⁥n)O(n/\log n) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(mlog⁥m)O(m\sqrt{\log m})-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size O(n/log⁥n)O(n/\log n) that answers Longest Common Extension queries (LCE queries) in O(1)O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/log⁥n)O(n/\log n) time.Comment: Full version of a paper accepted to STOC 201

    Efficient Computation of Sequence Mappability

    Get PDF
    Sequence mappability is an important task in genome re-sequencing. In the (k,m)(k,m)-mappability problem, for a given sequence TT of length nn, our goal is to compute a table whose iith entry is the number of indices j≠ij \ne i such that length-mm substrings of TT starting at positions ii and jj have at most kk mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of k=1k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in O(nmin⁡{mk,log⁡k+1n})\mathcal{O}(n \min\{m^k,\log^{k+1} n\}) time and O(n)\mathcal{O}(n) space for k=O(1)k=\mathcal{O}(1). It requires a carefu l adaptation of the technique of Cole et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also show O(n2)\mathcal{O}(n^2)-time algorithms to compute all results for a fixed mm and all k=0,
,mk=0,\ldots,m or a fixed kk and all m=k,
,n−1m=k,\ldots,n-1. Finally we show that the (k,m)(k,m)-mappability problem cannot be solved in strongly subquadratic time for k,m=Θ(log⁡n)k,m = \Theta(\log n) unless the Strong Exponential Time Hypothesis fails.Comment: Accepted to SPIRE 201

    Finding the Anticover of a String

    Get PDF
    A k-anticover of a string x is a set of pairwise distinct factors of x of equal length k, such that every symbol of x is contained into an occurrence of at least one of those factors. The existence of a k-anticover can be seen as a notion of non-redundancy, which has application in computational biology, where they are associated with various non-regulatory mechanisms. In this paper we address the complexity of the problem of finding a k-anticover of a string x if it exists, showing that the decision problem is NP-complete on general strings for k ? 3. We also show that the problem admits a polynomial-time solution for k=2. For unbounded k, we provide an exact exponential algorithm to find a k-anticover of a string of length n (or determine that none exists), which runs in O*(min {3^{(n-k)/3)}, ((k(k+1))/2)^{n/(k+1)) time using polynomial space

    IUPACpal: efficient identification of inverted repeats in IUPAC-encoded DNA sequences

    Get PDF
    Background: An inverted repeat is a DNA sequence followed downstream by its reverse complement, potentially with a gap in the centre. Inverted repeats are found in both prokaryotic and eukaryotic genomes and they have been linked with countless possible functions. Many international consortia provide a comprehensive description of common genetic variation making alternative sequence representations, such as IUPAC encoding, necessary for leveraging the full potential of such broad variation datasets. Results: We present IUPACpal, an exact tool for efficient identification of inverted repeats in IUPAC-encoded DNA sequences allowing also for potential mismatches and gaps in the inverted repeats. Conclusion: Within the parameters that were tested, our experimental results show that IUPACpal compares favourably to a similar application packaged with EMBOSS. We show that IUPACpal identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.</p

    Comparing Degenerate Strings

    Get PDF
    Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string S is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1,... , kn-1. Our main result is an (N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing all palindromes in S in (min{W, n2}N)-time. We complement this upper bound by showing a similar conditional lower bound for computing maximal palindromes in S. We also show that a result, which is essentially the same as our string comparison linear-time algorithm, can be obtained by employing an automata-based approach

    Dynamic IoT Malware Detection in Android Systems Using Profile Hidden Markov Models

    No full text
    The prevalence of malware attacks that target IoT systems has raised an alarm and highlighted the need for efficient mechanisms to detect and defeat them. However, detecting malware is challenging, especially malware with new or unknown behaviors. The main problem is that malware can hide, so it cannot be detected easily. Furthermore, information about malware families is limited which restricts the amount of “big data” that is available for analysis. The motivation of this paper is two-fold. First, to introduce a new Profile Hidden Markov Model (PHMM) that can be used for both app analysis and classification in Android systems. Second, to dynamically identify suspicious calls while reducing infection risks of executed codes. We focused on Android systems, as they are more vulnerable than other IoT systems due to their ubiquitousness and sideloading features. The experimental results showed that the proposed Dynamic IoT malware Detection in Android Systems using PHMM (DIP) achieved superior performance when benchmarked against eight rival malware detection frameworks, showing up to 96.3% accuracy at 5% False Positive Rate (FP rate), 3% False Negative Rate (FN rate) and 94.9% F-measure
    corecore